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Abstract 



Fast evolution of Internet technologies has led to an explosive growth of 
video data available in the public domain and created unprecedented chal- 
lenges in the analysis, organization, management, and control of such con- 
tent. The problems encountered in video analysis such as identifying a video 
in a large database (e.g. detecting pirated content in YouTube), putting to- 
gether video fragments, finding similarities and common ancestry between 
different versions of a video, have analogous counterpart problems in ge- 
netic research and analysis of DNA and protein sequences. In this paper, 
we exploit the analogy between genetic sequences and videos and propose 
an approach to video analysis motivated by genomic research. Representing 
video information as video DNA sequences and applying bioinformatic algo- 
rithms allows to search, match, and compare videos in large-scale databases. 
We show an application for content-based metadata mapping between ver- 
sions of annotated video. 



1 Introduction 



Today, the amount of video content available in the public domain is huge, ex- 
ceeding millions of hours, and is rapidly growing. Similar growth characterizes 
video-related metadata such as subtitle tracks and user-generated annotations and 
tags. However, these two types of information belong to two separate and largely 
unbridged domains. For example, English subtitles available on a DVD version of 
the Godfather movie are hard-wired to the timeline of the DVD video and cannot 
be used with a different version of the movie, e.g. downloaded from Bittorrent, 
streamed from YouTube, or broadcast over the air, which has a different timeline. 
Similarly, user-generated annotations and comments of a YouTube fragment of 
the Godfather are not accessible to a user watching the movie on DVD. 

A way to reconcile between the timelines of different versions of a video and 
the associated metadata is by using content-based synchronization. For this pur- 
pose, a time-dependent signature is computed for each video, allowing to match 
and align similar parts in different versions of the video, thus giving a translation 
from one system of time coordinates to another. In a prototype application consist- 
ing of a client and server, the signature is computed in real-time during the video 
playback on the client side and sent to the server where it is matched to a database 
of video signatures. After having established the correspondence to a database 
sequence, the corresponding metadata on the server side is sent to the client. With 
this approach, it is sufficient to keep a database of video signatures computed 
from some prototype sequence with synchronized metadata. A new version of 
the video, previously unseen and coming from any source (e.g. read-only media, 
streaming, etc.), can be matched to the prototype timeline and the corresponding 
metadata retrieved. Thus, at least theoretically, any video can be enriched with 
metadata, provided that similar videos have signatures in the database. 

The described application poses some requirements on the signature construc- 
tion and matching algorithms. First, they should be able to handle large amounts 
(thousands or millions of hours) of data. This, in turn, imposes the requirements 
that the signature is compact, easily indexable, and can be searched and matched 
fast. Secondly, the signature computation should be efficient, and ideally com- 
puted in real-time. Finally and most importantly, two versions of a video may 
different significantly due to post-production processing (e.g. resolution and as- 
pect ratio change, cropping, color and contrast modification, overlay of logos, 
compression artifacts, blur, etc.) and editing (e.g. advertisement insertion or adap- 
tation of a movie for a certain rating category). The signature matching algorithm 
must therefore be able to cope with such modifications. 
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Surprisingly, similar problems are encountered in an apparently unrelated field 
of genetic research, where one of the main problems is matching of DNA and 
protein sequences. Many recent efforts, including the notorious Gene Bank and 
Human Genome projects, resulted in having large collections of annotated DNA 
and protein sequences, in which newly discovered sequences can be looked up. 
The problem of post-processing distortions and editing is analogous to mutations 
occurring in biological DNA sequences. The scale of genetic data is comparable to 
that of video sequences (for example, the human genome contains sequences with 
nearly 3 billion symbols [11]). Over the past decades, many efficient methods 
have been developed for the analysis of genomic sequences, giving birth to the 
field of bioinformatics [22]. 

In this paper, we borrow well-established bioinformatic methods for the anal- 
ysis of video, which can be considered similarly to DNA sequences as shown 
in Section 2. A prototype application considered and shown in the supplemen- 
tary materials is content-based metadata mapping between versions of video. The 
central problem in this application is finding correspondences between video se- 
quences. In Section 3, we draw the analogy to genomic research, which allows 
to employ dynamic programming sequence alignment [23, 27] and its fast heuris- 
tics [24, 1], as well as multiple sequence alignment and phylogenetic analysis 
[19]. Exploring the analogy between mutations in genetic sequences and post- 
production processing and editing in video, we propose in Section 4 a generative 
approach for learning invariance to such mutations by means of metric learning. 
We obtain a very compact representation (64 bit per second of video), which is 
robust to video transformations and allows efficient indexing and search. Sec- 
tion 5 presents experimental results demonstrating the robustness and efficiency 
of the proposed approach in a variety of applications, including video retrieval 
and alignment in a large-scale (IK hours) database. Finally, Section 6 concludes 
the paper. 

1.1 Related work 

The problem of metadata mapping addressed in this paper is intimately related to 
content-based copy detection and search in video [17, 6]. There, one tries to find 
copies of a video that has undergone modifications (whether intentional or not) 
that potentially make it very different visually from the original. This problem 
should be distinguished from action and event recognition [31,3, 16], where the 
similarity criterion is semantic. Broadly speaking, copy detection problems boil 
down to invariant retrieval (finding a video invariant to a certain class of trans- 
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formations) and action recognition are problems of categorization (recognizing 
a certain class of behaviors in video). To illustrate the difference, imagine three 
video sequences: a movie quality version of Star Wars, the same version broad- 
cast on TV with ad insertion and captured off screen with a camcorder, and the 
lighsabre fight scene reenacted by amateur actors. The purpose of copy detection 
is to say that the first and the second video sequences are similar; action recog- 
nition, on the other hand, should find similarity between the second and third 
videos. 

One of the cornerstone problems in content-based copy detection and search 
is the creation of a video representation that would allow to compare and match 
videos across versions. Different representations based on mosaic [12], shot 
boundaries [10], motion, color, and spatio-temporal intensity distribution [ ], color 
histograms [ ], and ordinal measure [9], were proposed. When considering large 
variability of versions due to post-production modifications, methods based on 
spatial [20, 21,2] and spatio-temporal [15] points of interest and local descriptors 
were shown to be advantageous [14]. In addition, these methods proved to be very 
efficient in image search in very large databases [26, 4]. More recently, Willems 
et al. [30] proposed feature-based spatio-temporal video descriptors combining 
both visual information of single video frames as well as the temporal relations 
between subsequent frames. 

One of the main disadvantages of existing video representations is a construc- 
tive approach to invariance to video transformations. Usually, the representation 
is designed based on quantities and properties of video insensitive to typical trans- 
formations. For example, using gradient-based descriptors [20, 2] is known to be 
insensitive to illumination and color changes. Such a construction may often be 
unable to generalize to other classes of transformations, or result in a suboptimal 
tradeoff between invariance and discriminativity. 

An alternative approach, adopted in this paper, is to learn the invariance from 
examples of video transformations. By simulating the post-production and edit- 
ing process, we are able to produce pairs of video sequences that are supposed 
to be similar (different up to a transformations) and pairs of sequences from dif- 
ferent videos supposed to be dissimilar. Such pairs are used as a training set for 
similarity preserving hashing and metric learning algorithms [25, 13, 29] in order 
to create a metric between video sequences that achieves optimal invariance and 
discriminativity on the training set. 
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2 Video DNA 



Biological DNA data encountered in bioinformatic applications are long sequences 
consisting of four letters (representing aminoacids in the DNA molecule, denoted 
as A, T, C, G and referred to as nucleotides). Extending this example to our prob- 
lem, one can conceptually think of video as of a sequence of visual information 
units, which can be represented over some potentially very large alphabet of visual 
concepts, resulting in a sequence of "letters" (or visual nucleotides) which we call 
video DNA by analogy to genetic sequences. Video DNA sequencing, the process 
of creating a video DNA sequence out of a video, is performed by computing de- 
scriptors for each frame (or short sequence of frames) and arranging them on the 
video timeline (see Figure 1). 

In this paper, we used a feature -based representation following the standard 
bag of features paradigm [26, 4]. For each frame in the video, we scale down 
to horizonal resolution of 320, detect feature points, and compute local image 
descriptors around these points using a modification of the speeded-up robust 
features (SURF) [ ] feature detection and description algorithm (Figure 1, top). 
450 strongest feature points are used. Each feature point is described by a 64- 
dimensional grayscale and 16-dimensional color descriptor. Second, the local de- 
scriptors are quantized using the A;-means clustering algorithm, separately for the 
grayscale and color feature descriptors, creating grayscale and color visual vocab- 
ularies. Vocabulary of 2048 and 124 visual words are used for grayscale and color 
descriptors, respectively. Each local feature descriptor is replaced by the index of 
the nearest visual word in the vocabulary. Third, each frame is divided into four 
quadrants with 10% overlap and a bag of features (histogram of visual words) in 
each quadrant is computed. Four concatenated histograms yield a vector of size 
d = 8688 which is used as the frame descriptor (Figure 1, bottom). Fourth, a me- 
dian of frame descriptors in fixed time intervals is computed, creating the video 
DNA sequence. The intervals taken are of size T with step A T . A typical choice 
is T = 2sec and A T = lsec. 

The resulting video DNA is a timed sequence of rf-dimensional bags of fea- 
tures, which we call visual nucleotides by analogy to biological DNA sequences. 
The similarity of two video sequences can be quantified by measuring the dis- 
tance between the corresponding visual nucleotides, which we denote here by dj±. 
In the simplest case, a Euclidean distance in W 1 is used. In [26], it was shown 
that a Euclidean distance weighted by the statistical distribution of visual words 
{term, frequency-inverse document frequency or tf-idf) is a better way to compare 
bags of features. We will address the construction of an optimal distance between 
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visual nucleotides in Section 4. 




Figure 1: Construction of the visual nucleotides. Top: features detected in a 
video frame; bottom: corresponding bag of features. After applying similarity- 
preserving hashing to the bag of feature, the frame is represented by the 64-bit 
binary word 223E9DF01ADB3E00. 



3 Search and alignment 

Dynamic programming methods used to align biological DNA sequences, no- 
tably, the Needleman-Wunsch (NW) [23] and Smith-Waterman (SWAT) [27] al- 
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gorithms, can be applied to finding correspondence between versions of video 
sequences. 

Let x = (xi,...,Xm) and y = (yi, y^); Xi, y^ e R d be two video DNA 
sequences representing two versions of a video obtained by temporal editing. In 
this case, x and y will typically have locally similar sequences of nucleotides. In 
order to find such similarities, we look for an optimal local alignment between x 
and y, i.e., such a correspondence of indices {1, ...N} and {1, M} that on one 
hand will make the corresponding nucleotides the most similar and on the other 
will contain gaps of minimum total length. The quality of the correspondence is 
represented by a similarity score, taking into consideration both the similarity of 
the nucleotides and the gaps. 

The minimum dissimilarity score between the substring of x of length i and 
substring of y of length j is given by the following recursive equation, 



where i = l,...,M,j = l,...,iVand s i0 = s j = for alH = 0,...,M,j = 
0, N. dj[(a, b) is the similarity between nucleotides and g(a) is the gap penalty. 
The values of s are determined by means of dynamic programming and the opti- 
mal correspondence is established by backtracking [27]. 

3.1 Fast heuristics 

The main disadvantage of dynamics programming alignment methods is their high 
complexity of O(NM). In our application, when a short sequence (N of order 
of 10 3 — 10 4 for a typical movie assuming A T = lsec) is compared to a large 
database containing signature of thousands or millions of hours of videos (M in 
the order of 10 6 — 10 9 ), such an approach may be computationally prohibitive. A 
similar complexity problem is encountered in gene search applications in bioinfor- 
matics, where typical databases contain sequences totalling in millions or billions 
of letters. 

To overcome this problem, fast heuristics such as FASTA [ ] and BLAST [1] 
have been developed. The key idea of these approaches is to first locate matches 
of short combinations of nucleotides of fixed size k (typically ranging between 2 
and 10), which establish multiple coarse initial correspondence between regions 




match 
deletion 
insertion 




(1) 
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in the two sequences. Using search engine terminology, the initial correspondence 
established by FASTA/BLAST algorithms are a short list of candidates. The cor- 
respondence is later refined using a banded version of the SWAT algorithm, ap- 
plied on sequences around the initial regions. At this stage, video DNA sequences 
at higher temporal resolution can be used. 

3.2 Multiple sequence alignment 

In many cases, it is desired to find alignment between more than two videos, 
a problem analogous to multiple sequence alignment (MSA) in bioinformatics. 
MSA is used in phylogenetic analysis [19], in order to discover evolutionary re- 
lations between DNA sequences. In video, a similar problem is version control, 
where multiple versions of a video are given and one wishes to establish, for ex- 
ample, from which source they were derived and which sequence was the original. 

Straightforward generalization of dynamic programming alignment algorithms 
to MSA results in an exponential complexity. For this reason, sub-optimal heuris- 
tics such as progressive sequence alignment are used. For example, in CLUSTAL 
[28], first all pairs of sequences are aligned separately. Alignment cost acts as a 
measure of the pair-wise sequence dissimilarity. Given the pairwise dissimilarity 
matrix, a guide tree is constructed by means of clustering (e.g. neighorhood join- 
ing). Finally, series of pair- wise alignments following the branching order in the 
tree are performed. This way, most similar sequences are aligned first and most 
dissimilar last (for detailed algorithm description, see [28]). 

4 Mutation-invariant metric 

Post-production transformations in video are analogous to mutations in biological 
DNA sequences and can be manifested either as insertion or deletion of visual 
nucleotides (indel mutations) as a result of temporal editing, or as substitution 
mutations, in which the visual content is replaced by another as the result of spatial 
editing such as resolution or aspect ratio change, cropping, compression artifacts, 
overlay of subtitles or channel logo, etc. While local alignment is efficient in 
coping with insertion or deletion mutations by proper selection of the gap penalty, 
substitution mutations can be a major challenge, as they may have a global effect 
on the entire video DNA sequence (imagine, for example, that due to non-uniform 
scaling of the video, the bag of features changes in every frame). 
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In biological DNA sequence analysis, the exact mechanism of mutations is not 
completely understood or reproduced; therefore, empirical models of nucleotide 
mutation probability are used [ ]. In our case, on the other hand, it is easy to repro- 
duce the post-production processing that causes mutation in video DNA. Ideally, 
our visual nucleotides should be discriminative (such that two intervals belong- 
ing to different videos are dissimilar) and invariant (such that two transformations 
of the same interval are similar). Though our construction of visual nucleotides 
rely on feature descriptors that are insensitive to certain transformations of the 
frame (scale, mild brightness and contrast variations), other transformations (e.g. 
cropping, subtitle overlay, etc.) may result in different visual nucleotides. As 
a consequence, the simple Euclidean metric would not be invariant under such 
transformations. 

Yet, it is possible to learn the best mutation-invariant metric between nu- 
cleotides on a training set. Assume that we are given a set of nucleotides X 
describing different intervals of video, and T the class of all transformations in- 
variance to which is desired. We denote by V = {(x, x o r) : x £ X,r £ T} 
the set of all positive pairs (visual nucleotides of identical intervals, differing up 
to some transformation), and by N C X x X the set of all negative pairs (vi- 
sual nucleotides of different intervals). Negative pairs are modeled by sampling 
numerous intervals from different videos, which are known to be distinct. For 
positive pairs, we generate representative transformations from class T. Our goal 
is to find a metric between nucleotides that ideally is as small as possible on the 
set of positives and as large as possible on the set of negatives. 

Shakhnarovich [25] considered metric parameterized as 

d A ,b(x, x') = du(sign(Ax + b), sign(Ar' + &)), (2) 

where 

1 n 

dw{W) = ---5>gn(&£)> (3) 

i=l 

is the Hamming metric in the n-dimensional Hamming space HP = {— 1, +l} n 
of binary sequences of length n. A and b are an n x d matrix and an n x 1 vector, 
respectively, parameterizing the metric. Our goal is to find A and b such that dA,b 
reflects the desired similarity of pairs of visual nucleotides x, x' in the training set. 

Ideally, we would like to achieve dA,b{x,x') < d for (x,x') £ V, and 
dA,b{x, x') > d for (x,x') £ Af, where do is some threshold. In practice, this 
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is rarely achievable as the distributions of dA,b on V and M have cross-talks re- 
sponsible for false positives id Ah < do on AO and false negatives (g?a,& > on 
V). Thus, optimal A, b should minimize these cross-talks, 
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(4) 



In [25], Shakhnarovich proposed considering learning optimal parameters A, b 
as a boosted binary classification problem, where dA,b acts as a strong binary clas- 
sifier, and each dimension of the linear projection sign(A fc x + b k ) can be consid- 
ered as a weak classifier. This way, AdaBoost algorithm can be used to progres- 
sively construct A and b, which would be a greedy solution of (4). At the k-th 
iteration, the k-th row of the matrix A and the fc-th element of the vector b are 
found minimizing a weighted version of (4). Weights of false positive and false 
negative pairs are increased, and weights of true positive and true negative pairs 
are decreased, using the standard Adaboost reweighting scheme [7]. While it is 
difficult to find A k minimizing (4) because of the non-linearity, we found that the 
minimizer of the exponential loss is related to another simpler problem, 



where C-p and CV are the covariance matrices of the positive and negative pairs, 

respectively. It can be shown that A k maximizing (5) is the largest generalized 

i i 

eigenvector of CjjA k = A max Cp^4 fc . Since the minimizers of (4) and (5) do not 
coincide exactly, in our implementation, we select a subspace spanned by the 
largest ten eigenvectors, out of which the direction as well as the threshold param- 
eter b minimizing the exponential loss are selected. 

There are a few advantages to the described approach. First, the metric g?a,& 
is constructed to achieve the best discriminativity and invariance on the training 
set. If the training set is sufficiently representative, such a metric generalizes well. 
It can be used as dj, in the alignment and search algorithms described in Sec- 
tion 3. Secondly, the projection itself has an effect of dimensionality reduction, 
and results in a very compact representation of visual nucleotides as bitcodes (for 
example, the frame shown in Figure 1 is represented by the hexadecimal word 
223E9DF01ADB3E00). Such bitcodes can be efficiently stored and manipulated 




(5) 
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in standard databases. Thirdly, modern CPU architectures allow very efficient 
computation of Hamming distances using bit counting and SIMD instructions. 
Since each of the bits can be computed independently, score computation in the 
alignment algorithm can be further parallelized on multiple CPUs using either 
shared or distributed memories. Due to the compactness of the bitcode repre- 
sentation, search can be performed in memory (a single 8GB memory system 
is sufficient to store about 300,000 hours of video with 1 second resolution for 
n = 64). 

5 Results 

In the experimental validation, we worked with a database containing 1013 hours 
of assorted video content (movies, 2D and 3D cartoons, talk shows, sports) taken 
from DVDs. Video DNA sequences were computed with parameters T = 2sec 
and At = lsec. Hamming space of dimension n = 64 was used for bitcode 
representation. Metric learning was performed offline on a training set contain- 
ing 2 x 10 5 positive and 8 x 10 5 negative pairs. Positives were created using 
transformation simulated with AviSynth frame server. 

Large scale search. For the evaluation of search and alignment, we used a 
scheme proposed by [ ]. Randomly selected short sequences from the database 
were used as queries. The queries were constructed in such a way that there was 
exactly one correct match with the database. In BLAST and FASTA-type algo- 
rithms, the queries represent the short nucleotide sequences used to establish ini- 
tial matches. The queries underwent transformations (shown in Figure 2) typical 
for the video post-production, including spatial and pixel transformations (crop- 
ping, letter and pillar box, contrast and color balance, compression noise, resolu- 
tion and aspect ratio change, subtitle overlay) and temporal transformations (fram- 
erate change and time shift). Each transformation appeared at multiple strengths 
(denoted as 1-3). 

Short sequences locally matching to the queries were found in the database 
using a FAS TA/B LAST- type algorithm described in Section 3. The matching pre- 
cision was measured as precision with recall of 1, i.e., the percentage of correct 
first matches. Matches were considered correct if they were within 1 sec toler- 
ance off the groundtruth match (i.e., falling within the temporal resolution of our 
representation). Typical search time was 250 msec. 

Table 1 shows the breakdown of search precision according to transformations 
types and strengths. 10,770 queries of 10 sec length were used. Table 2 shows 
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Figure 2: Examples of transformations used in our experiments. Top: geometric trans- 
formations (non-uniform scale, cropping, letter box and borders). Bottom: pixel transfor- 
mations (gamma, blur, quantization, subtitles). 



the search precision as function of the query length (varying from 5 to 30 sec), 
on a query set of 20,160 queries, including all transformations of strength 1-3. 
It shows that 10 sec of video are sufficient to achieve less than 3% search error 
in a database of 1013 hours across versions including significant transformations. 
This number falls below 1% for a 20 sec query. 

Local alignment. In order to evaluate the performance of local alignment, we 
performed alignment of sequences from subset of the database containing approx- 
imately 300 hours of video using the dynamic programming algorithm described 
in Section 3. Query sequences underwent spatial transformations from the pre- 
vious experiments, as well as different temporal transformations. The latter in- 
cluded deletion of portions of video, substitution with other videos, and insertion 
of blackness periods (both with sharp or gradual fade-in and fade-out transitions 
of different durations); local speeding up and slowing down of the video playback 
speed; and removal of significant parts of the original footage from the query se- 
quence. Table 3 shows the breakdown of alignment precision according to trans- 
formations types and strengths. An example of two aligned versions of a sequence 
from the Desperate Housewives series is shown in Figure 3. 
Phylogenetic analysis. Figure 4 shows a dendrogram representing the evolu- 
tionary relations between six versions derived from the Desperate Housewives 
from Figure 3. Version x.y was obtained by removing a shot from sequence x. 
The dendrogram was constructed from the matrix of pairwise sequence distances 
(computed as the ratio of the gaps length to the total sequence length in aligned 
pairs of sequences) using neighbor joining approach. One can clearly see how 
subsequent versions were derived. 
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Figure 3: Top: two versions of a video have different timelines because of editing. 
Bottom: alignment based on Video DNA brings the two timelines in correspondence. 



Table 1 : Precision (percentage of first matches falling within 1 sec tolerance) broken 
down according to transformation types and strengths. 



Strength 



Transform. 


1 


2 


3 


Blur 


100.00 


100.00 


100.00 


Soften 


100.00 


98.81 


98.81 


Sharpen 


100.00 


100.00 


100.00 


Brighten 


100.00 


100.00 


99.21 


Darken 


100.00 


99.60 


95.63 


Contrast 


99.80 


100.00 


98.21 


Saturation 


100.00 


98.81 


91.47 


Quantization 


88.89 


86.90 


90.08 


Overlay 


100.00 


99.91 


98.41 


Crop 


98.77 


95.59 


89.95 


Letterbox 


99.12 


98.59 


97.53 


Nonunif. scale 


98.41 


99.21 


96.56 


Uniform scale 


100.00 


100.00 


100.00 


Framerate 


100.00 


100.00 


100.00 


Time shift 


100.00 


99.91 


99.63 
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Figure 4: Phylogenetic analysis of six versions of Desperate Housewives. Each leave in 
the dendrogram represents a sequence, labeled according to its version (e.g., 1.1. means 
the sequence was derived from sequence 1 by means of removing a shot). Vertical axis 
represents the distance between the versions, computed as the percentage of dissimilar 
parts (gaps). Evolutionary relations between versions can be clearly inferred from the 
dendrogram. 
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Table 2: Precision (percentage of first matches falling within 1 sec tolerance) as function 
of query length in seconds. 



5 sec 


10 sec 


15 sec 


20 sec 


30 sec 


93.7 


97.08 


98.36 


99.22 


99.38 



Table 3: Precision (percentage of first matches within 1 sec tolerance) broken down 
according to transformation types and strengths. 







Strength 




Transformation 


1 


2 


3 


Deletion & insertion 


99.99 


99.96 


99.81 


Partial 


99.96 


99.90 


99.34 


Fade ins & outs 


99.92 


99.16 


99.40 


Local speed changes 


99.91 


99.90 


99.90 


Substitutions 


98.45 


95.67 


91.01 


Overlay 


99.85 


99.62 


99.28 



6 Conclusions 

We presented a framework for the construction of robust and compact video rep- 
resentations. By appealing to the analogy between genetic sequences and video, 
we employed bioinformatics algorithms that allow efficient search and alignment 
of video sequences. Also, we showed that using metric learning, it is possible to 
design an optimal metric on a training set of generated video transformations. 

We believe that harvesting video and related metadata available in the pub- 
lic domain and creating a database of annotated video DNA sequences together 
with search and alignment tools could eventually have an impact similar to that 
of the Human Genome project in genomic research. Having, for example, a large 
database containing signatures of the most popular Hollywood movies would al- 
low identifying and synchronizing any version of a movie no matter when, where, 
and from which source it is played. The database can be used for finding copies 
and versions of movies on the web, in order to cope with piracy, enhance video 
content with metadata such as subtitles, or provide keywords for contextual ad- 
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vertisement engines. Finally, human annotations and semantic information would 
enable video understanding by using matching annotations of similar videos from 
the database. 
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